Skip to content

fix(hdfs): EC-safe reads for Parquet/ORC and libhdfs3 striped reader#522

Open
zhanglistar wants to merge 2 commits into
Kyligence:rebase_ch/20250729from
bigo-sg:feat/support-ec-read
Open

fix(hdfs): EC-safe reads for Parquet/ORC and libhdfs3 striped reader#522
zhanglistar wants to merge 2 commits into
Kyligence:rebase_ch/20250729from
bigo-sg:feat/support-ec-read

Conversation

@zhanglistar
Copy link
Copy Markdown

  • Add PositionStripeReader.cpp to libhdfs3-cmake SRCS (fix undefined PositionStripeReader symbol when linking StripedInputStreamImpl).
  • HDFS ReadBufferBuilder: do not trust Substrait properties.filesize for Parquet/ORC (incl. Iceberg); always stat real length for footer/postscript.
  • Parquet: prefer RandomAccessFileFromRandomAccessReadBuffer when readBigAt is available so footer ReadAt uses pread, not seek+read.
  • ORC (Gluten): OrcUtil path mirrors the same Arrow RandomAccessFile choice where applicable.
  • NativeORCBlockInputFormat: decouple offset-based read (readBigAt) from use_prefetch so Gluten (use_prefetch=false) still uses pread on HDFS EC; rename flag to use_offset_based_read for clarity.

Changelog category (leave one):

  • New Feature
  • Improvement
  • Performance Improvement
  • Backward Incompatible Change
  • Build/Testing/Packaging Improvement
  • Documentation (changelog entry is not required)
  • Bug Fix (user-visible misbehavior in an official stable release)
  • Not for changelog (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

- Add PositionStripeReader.cpp to libhdfs3-cmake SRCS (fix undefined
  PositionStripeReader symbol when linking StripedInputStreamImpl).
- HDFS ReadBufferBuilder: do not trust Substrait properties.filesize for
  Parquet/ORC (incl. Iceberg); always stat real length for footer/postscript.
- Parquet: prefer RandomAccessFileFromRandomAccessReadBuffer when
  readBigAt is available so footer ReadAt uses pread, not seek+read.
- ORC (Gluten): OrcUtil path mirrors the same Arrow RandomAccessFile
  choice where applicable.
- NativeORCBlockInputFormat: decouple offset-based read (readBigAt) from
  use_prefetch so Gluten (use_prefetch=false) still uses pread on HDFS EC;
  rename flag to use_offset_based_read for clarity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant